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Abstract 



We give efficient algorithms for volume sampling, i.e., for picking fc-subsets of the rows of 
any given matrix with probabilities proportional to the squared volumes of the simplices defined 
by them and the origin (or the squared volumes of the parallelepipeds defined by these subsets 
of rows). This solves an open problem from the monograph on spectral algorithms by Kannan 
and Vempala (see Section 7.4 of [TS], also implicit in [HIS]). 

Our first algorithm for volume sampling fc-subsets of rows from an m-hy-n matrix runs in 
0{kmn^ logn) arithmetic operations and a second variant of it for (1 + e)-approximate volume 
sampling runs in 0(mnlogm • k^/e^ + mlog'^m • fc^"+^/e^'^ • log(fce^^ logm)) arithmetic op- 
erations, which is almost linear in the size of the input (i.e., the number of entries) for small 
k. 

Our efficient volume sampling algorithms imply the following results for low-rank matrix 
approximation: 

1. Given A G M™^", in 0(fcmri" log n) arithmetic operations we can find fc of its rows such 
that projecting onto their span gives a vfc+l-approximation to the matrix of rank k clos- 
est to A under the Frobenius norm. This improves the 0{ky/\og fc)-approximation of Bout- 
sidis, Drineas and Mahoney [1] and matches the lower bound shown in [5j. The method of 
conditional expectations gives a deterministic algorithm with the same complexity. The 
running time can be improved to 0{mn \ogm-k'^/e^+m log^ TO-fc^"+-'^/e^" dog(fce~^ log m)) 
at the cost of losing an extra (1 + e) in the approximation factor. 

2. The same rows and projection as in the previous point give a ■\/(fc + l)(n — fc')-approximation 
to the matrix of rank fc closest to A under the spectral norm. In this paper, we show an 
almost matching lower bound of ^/n, even for fc = 1. 

Keywords: volume sampling, low-rank matrix approximation, row/column subset selection 

1 Introduction 

Volume sampling, i.e., picking A;-subsets of the rows of any given matrix with probabilities pro- 
portional to the squared volumes of the simplicies defined by them, was introduced in [5] in the 
context of low-rank approximation of matrices. It is equivalent to sampling fe-subsets of {1, ... , m} 
with probabilities proportional to the corresponding k hy k principal minors of any given m hy m 
positive semidefinite matrix. 

In the context of low-rank approximation, volume sampling is related to a problem called 
row/ column- sub set selection [1]. Most large data sets that arise in search, microarray experiments, 
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computer vision, data mining etc. can be thought of as matrices where rows and columns are 
indexed by objects and features, respectively (or vice versa), and we need to pick a small subset 
of features that are dominant. For example, while studying gene expression data biologists want 
a small subset of genes that are responsible for a particular disease. Usual dimension reduction 
techniques such as principal component analysis (PCA) or random projection fail to do this as 
they typically output singular vectors or random vectors which are linear combinations of a large 
number of feature vectors. A recent article by Mahoney and Drineas [18] highlights the limitations 
of PCA and gives experimental data on practical applications of low-rank approximation based on 
row/column-subset selection. 

2 Row/column-subset selection and volume sampling 

While dealing with large matrices in practice, we seek smaller or low-dimensional representations 
of them which are close to them but can be computed and stored efficiently. A popular notion 
for low-dimensional representation of matrices is low-rank matrices, and the most popular metrics 
used to measure the closeness of two matrices are the Frobenius or Hilbert-Schmidt norm (i.e., the 
square root of the sum of squares of entries of their difference) and the spectral norm (i.e., the 
largest singular value of their difference). The singular value decomposition (SVD) tells us that 
any matrix A G can be written as 

m 

A = ^aiUivf, 

i=l 

where cJi > . . . > am > 0, £ are orthonormal and Vi £ are ort honor mal. Moreover, the 
nearest rank- A; matrix to A, let us call it A/^, under both the Frobenius and the spectral norm, is 
given by 

k 
i=l 

In other words, the rows of are projections of the rows of A onto span (uj : 1 < i < A;). Be- 
cause of this, most dimension reduction techniques based on the singular value decomposition, e.g., 
principal component analysis (PCA), are interpreted as giving Vi's as the dominant vectors, which 
happen to be linear combinations of a large number of the rows or feature vectors of A. 

The row-subset selection problem we consider in this paper is: Can we pick a /c-subset of the 
rows (or feature vectors) of ^ G j^mxn i^j-^g^^ projecting onto their span is almost as good as 
projecting onto span(?;j : 1 < i < k)7 

Several row-sampling techniques have been considered in the past as an approximate but faster 
alternative to the singular value decomposition, in the context of streaming algorithms and large 
data sets that cannot be stored in random access memory [SI [3 [5]. The first among these is the 
squared-length sampling of rows introduced by Frieze, Kannan and Vempala [8]. Another sampling 
scheme due to Drineas, Mahoney and Muthukrishnan [7] uses the singular values and singular 
vectors to decide the sampling probabilities. Later Deshpande, Rademacher, Vempala and Wang 
[5] introduced volume sampling as a generalization of squared-length sampling. 

Definition 1. Given A E M™^", volume sampling is defined as picking a k-subset S of [m] with 
probability proportional to 

det {AsA^) = {k\ ■ vol (conv ({0} U {a^ : z G S]))f , 
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where ai denotes the i-th row of A, As G MJ'^'^ denotes the row-submatrix of A given by rows with 
indices i £ S, and conv (•) denotes the convex hull. 

The application of volume sampling to low-rank approximation and, more importantly, to the 
row-subset selection problem, is given by the following theorem shown in |5]. It says that picking 
a subset of k rows according to volume sampling and projecting all the rows of A onto their span 
gives a, (k + l)-approximation to the nearest rank-A; matrix to A. 



Theorem 2. Given any A £ 



l^-^s(^)|| 



< (A; + 1) \\A-A 



kUF ' 



when S is picked according to volume sampling, irs{A) G ]^"^x" denotes the matrix obtained by 
projecting all the rows of A onto span(aj : i £ S), and A^ is the matrix of rank k closest to A 
under the Frobenius norm. 

As we will see later, this easily implies 

E[\\A-7Ts{A)\\^] < y^{k + l){n-k)\\A-Ak\\2. 

Theorem [2] gives only an existence result for row-subset selection and we also know a matching 
lower bound that says this is the best we can possibly do. 

Theorem 3. ,fj/ For any e > 0, there exists a matrix A G mC^+i)^'^ such that picking any k-subset 
S of its rows gives 

P-7r5(^)||^ > {l-e)VkTT\\A-Ak\\F. 

However, no efficient algorithm was known for volume sampling prior to this work. An algorithm 
mentioned in Deshpande and Vempala [6] does A;!-approximate volume sampling in time 0{kmn), 
which means that plugging it in Theorem [2] can only guarantee (k -\- l)!-approximation instead of 
(A; -|- 1). Finding an efficient algorithm for volume sampling is mentioned as an open problem in 
the recent monograph on spectral algorithms by Kannan and Vempala (see Section 7.4 of |15j). 

Boutsidis, Drineas and Mahoney [1] gave an alternative approach to row-subset selection (with- 
out going through volume sampling) and here is a re-statement of the main theorem from their 
paper which uses columns instead of rows. 

Theorem 4. JJ/ For any A G M™^", a k-subset S of its rows can be found in time O (minjmn^, m^n}) 
such that 

\\A - 7rs{A)\\p = 0{k./hgk) \\A - AkWp 

\\A - ns{A)\\^ = O (fc3/4(n - k)^/^^/l^) \\A - A^W^ ■ 

Row/column-subset selection problem is related to rank-revealing decompositions considered 
in linear algebra \12\ [T9] , and the previous best algorithmic result for row-subset selection in the 
spectral norm case was given by a result of Gu and Eisenstat [12] on strong rank-revealing QR 
decompositions. The following theorem is a direct consequence of [12] as pointed out in [1]. 
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Theorem 5. Given A S M™'^", an integer k < n and / > 1, there exists a k-subset S of the 
columns of A such that 



In the context of volume samphng, it is interesting to note that Pan [19] has used an idea 
of picking submatrices of locally maximum volume (or determinants) for rank-reveahng matrix 
decompositions. We refer the reader to [19] for details. 

The results of Goreinov, Tyrtyshnikov and Zamarashkin [lO^lllj on pseudo-skeleton approxima- 
tions of matrices look at submatrices of maximum determinants as good candidates for row/column- 
subset selection. 



V ^21 A22 J ' 

where An G R'^^^ is the k by k submatrix of A of maximum determinant. Then, 

max 1(^22 - Ai2A^^A2i)ij \ < {k + 1) \\A - Ak\\2 ■ 
hi 

Because of this relation between row/column-subset selection and the related ideas about picking 
submatrices of maximum volume, Qivril and Magdon-Ismail [H S] looked at the problem of picking 
a fc-subset 5 of rows of a given matrix A S K'^x" such that det (^5^^) is maximized. They show 
that this problem is NP-hard [3] and moreover, it is NP-hard to even approximate it within a factor 
of 2^^^ , for some constant c > [5]. This is interesting in the light of our results because we show 
that even though finding the row-submatrix of maximum volume is NP-hard, we can still sample 
them with probabilities proportional to their volumes in polynomial time. 

2.1 Our results 

Our main result is a polynomial time algorithm for exact volume sampling. In Section HI we give 
an outline of our Algorithm [H followed by two possible subroutines given by Algorithms [2] and [3] 
that could be plugged into it. 

Theorem 7 (polynomial-time volume sampling). The randomized algorithm given by the combi- 
nation of the algorithm outlined in Algorithm\^ with Algorithm\M as its subroutine, when given a 
matrix A G ]K™x" and an integer 1 < k < rank(74), outputs a random k-subset of the rows of A 
according to volume sampling, using 0{kmn'^ logn) arithmetic operations. 

The basic idea of the algorithm is as follows: instead of picking a fe-subset, pick an ordered k- 
tuple of rows according to volume sampling (i.e., volume sampling suitably extended to all fc-tuples 
such that for any fixed A:-subset, all its kl permutations are all equally likely). We observe that the 
marginal distribution of the first coordinate of such a random tuple can be expressed in terms of 
coefficients of the characteristic polynomials of AA'^ and BiBj , where Bi G ]^™x" ig the matrix 
obtained by projecting each row of A orthogonal to the i-th row Oj. Using this interpretation, it is 




Theorem 6. If A^W 



mxn 



can be written as 
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easy to sample the first index of the A;-tuple with the right marginal probability. Now we project 
the rows of A orthogonal to the chosen row and repeat to pick the next row, until we have picked 
k of them. 

The algorithm just described informally, if implemented as stated, would have a polynomial 
dependence in m, n and k, for some low-degree polynomial. We can do better and get a linear 
dependence in m by working with A^A in place of AA"^ and computing the projected matrices 
using rank-1 updates (Theorem [7]) , while still having a polynomial time guarantee and sampling 
exactly. It would be even faster to perform rank-1 updates to the characteristic polynomial itself, 
but that requires the computation of the inverse of a polynomial matrix (Proposition llSp . and it 
is not clear to us at this time that there is a fast enough exact algorithm that works for arbitrary 
matrices. Jeannerod and Villard [13] give an algorithm to invert a generic n-hy-n matrix with 
entries of degree d, with n a power of two, in time 0{n^d). This would lead to the computation 
of all marginal probabilities for one row in time 0{n^ + mn'^) (a variation of Algorithm [3] and its 
analysis) . 

Instead, if we are willing to be more practical, while sacrificing our guarantees, then we can 
perform rank-1 updates to the characteristic polynomial by using the singular value decomposition 
(SVD). In [9], an algorithm with cost 0(min{?7in^, m^n}) arithmetic operations is given for the 
singular value decomposition but the SVD cannot be computed exactly and we do not know how 
its error propagates in our algorithm which uses many such computations. If the SVD of an m- 
by-n matrix can be computed in time 0{Ts^d)j this leads to a nearly-exact algorithm for volume 
sampling in time 0{kTs^d + kmn'^). See Proposition [18] for details. 

Volume sampling was originally defined in [5] to prove Theorem [21 in particular, to show that 
any matrix A contains k rows in whose span lie the rows of a rank-fc approximation to A that 
is no worse than the best in the Frobenius norm. Efficient volume sampling leads to an efficient 
selection of k rows that satisfy this guarantee, in expectation. In Section [SJ we use the method 
of conditional expectations to derandomize this selection. This gives an efficient deterministic 
algorithm (Algorithm [J]) for row-subset selection with the following guarantee in the Frobenius 
norm. This guarantee immediately implies a guarantee in the spectral norm, as follows: 

Theorem 8 (deterministic row subset selection). Deterministic Algorithm^ when given a ma- 
trix A G j^J^x" Qfid integer 1 < k < rank(74), outputs a k-subset S of the rows of A, using 
0{kmn^logn) arithmetic operations, such that 

\\A-7rs{A)\\p < VkTT\\A-Ak\\p 
P-vrs(^)|l2 < Vik + l){n-k)\\A-Ak\\2. 

This improves the 0(A;\/IogX)-approximation of Boutsidis, Drineas and Mahoney [1] for the 
Frobenius norm case and matches the lower bound shown in Theorem [3] due to [5]. 

The superlinear dependence on n might be too slow for some applications, while it might 
be acceptable to perform volume sampling or row/column-subset selection approximately. Our 
volume sampling algorithm (Algorithm [T]) can be made faster, while losing on the exactness, by 
using the idea of random projection that preserves volumes of subsets. Magen and Zouzias |17j 
have the following generalization of the Johnson-Lindenstrauss lemma: for m points in M" there 
exists a random projection of them into W^, where d = O (/c^ log m/e^) , that preserves the volumes 
of simplices formed by subsets of k or fewer points within lie. Therefore, we get a (1 it e)- 
approximate volume sampling algorithm that requires 0{mnd)-tim.e to do the random projection 
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(by matrix multiplication) and then 0{mnd^ log d) time for volume sampling on the new m-hy-d 
matrix (according to Theorem [7]). 

Theorem 9 (fast volume sampling). Using random projection for dimensionality reduction, the 
polynomial time algorithm for volume sampling mentioned in Theorem^ (i.e.. Algorithm {1\ with 
AlgorithmlM as its subroutine), gives (1 + e)- approximate volume sampling, using 

O ( mn log m ■ — + m log'^ m ■ — log(fce~"'^ log m) j . 

arithmetic operations. 

Finally, we show a lower bound for row/column-subset selection in the spectral norm that almost 
matches our upper bound in terms of the dependence on n. 

Theorem 10 (lower bound). There exists a matrix A G such that 

\\A - Tr^iy{A)\\^ = 0,{^/n) \\A - Ai\\2 , for all 1 < i < n, 

where 7r|jj(^) G is the matrix obtained by projecting each row of A onto the span of its 

i-th row Ui. 



3 Preliminaries and notation 

For m G N, let [m] denote the set {1, . . . ,m}. For any matrix A G R*"^"-, we denote its rows by 
oi, a2, . . . , Om S K'^- For S C [m], let As be the row-submatrix of A given by the rows with indices 
in 5. By span (S) we denote the linear span of {oj : i £ S} and let TTsiA) G R™-^" be the matrix 
obtained by projecting each row of A onto span(S'). Hence, A — Trs{A) G R™^" is the matrix 
obtained by projecting each row of A orthogonal to span (S). 

Throughout the paper we assume m > n. This assumption is not needed most of the time, 
but justifies sometimes working with Ai'- A instead of AA^ and, more generally, some choices in 
the design of our algorithms. It is also partially justified by our use of a random projection as a 
preprocessing step that makes n small. 

The singular values of A G R™^" are defined as the positive square-roots of the eigenvalues of 
AA^i^ G R"*^™ (or Ai^ A G R"^", up to some extra singular values equal to zero), and we denote 
them by o"! > (T2 > • • • > dm > 0. Well-known identities of the singular values like 

m m 

trace {AA^) = ^ af and det {AA^) = of 

i=l i=l 

can be generalized into the following lemma. 

Lemma 11. (Proposition 3.2 in JSj^) For any A G R™^", 

^ det (^5^^)= Yl 4---4 = |cm-fc(^^^)|, 

SC[m] : \S\=k ii<-<ik 

where cJi, . . . ,am are the singular values of A, i.e., eigenvalues of AA^ , and 

m 

det {xl - AA^) = x"' + Cm-i(AA^)x"'-^ + . . . + cq{AA^) = JJ(x - af), 

i=l 
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is the characteristic polynomial of AA^ . Using det [xl — AA^^ = x"^ "det [xl — Al^ , we can 
alternatively use CYfi—k^AA^^ — Cji—i^i^A^ A^ in the above formula^ for k ^ n. 

Let cj be the exponent of the arithmetic complexity of matrix multiplication. We use that there 
is an algorithm for computing the characteristic polynomial of an n-by-n matrix using 0{n^ logn) 
arithmetic operations O Section 16.6]. 

Here is another lemma that we will need about dividing determinants into products of two 
determinants. 

Lemma 12. Lei A G M™^", 5,rc [m], SnT = $ and B = A - 7Ts{A). Then 

det (AsutA'^ut) = det (AsA^) det {BtB^) . 

Proof. Without loss of generality, we can reduce ourselves to the case where U T is all rows of 
the given matrix: Let C = AsuT e Ml^ur|xn^ D = C - t^s{C). We have D = BsuT- Then what we 
want to prove can be rewritten as: 

det(CC^) = det(C5Cj)det(DT-D?). 

To show this, we consider two cases. If C^Cj is singular, then both sides of the equality are zero. 
If CsCg is invertible, then we can perform block Gaussian elimination and write 

(E F\ (I -E~^F\ _ fE 
\G h)\^ I ) ~ \G D-GE-^F 

f E F\ 

applied to ( ^ HI ~ ^' determinants of the block-triangular matrices gives 

det(C7C7^) = det(CsCj) det(CTC|^ - GtG^ {G sG^)"^ G sG^) . 
Now, the projection of the rows of a matrix K onto the row-space of a matrix L can be written as 

i:l{K) = KL^{LL^)~^L, 
so that Dt = Gt- GrGjiGsGjy^Gs, and 

Dj-oTp = GxGj- — GtG'q [GsG'g) ^GsG-f. 
This completes the proof. □ 

Finally, a well-known lemma about how the determinant of a matrix changes under a rank-1 
update. 

Lemma 13 (matrix determinant lemma). For any invertible M G ^"m-xm ^ W^'', 

det (M + uv"^) = (1 + v'^Ar^u) det (M) . 
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4 Efficient volume sampling algorithms 



We first outline our volume sampling algorithm to convince the reader that volume sampling can 
be done in polynomial time. In the subsequent subsections, we give improved subroutines to get 
faster implementations of the same idea. 

The main idea behind our algorithm is based on Lemma [T3] about the marginal probabilities 
encountered in volume sampling. To explain this, it is more convenient to look at volume sampling 
defined as a distribution on A;-tuples (Xi, X2, . . . , X^) instead of A;-subsets, where each of the kl 
permutations of a fc-subset is equally likely, i.e., for any (^1,^2, ■ ■ ■ ,ik) S \m]^ , 



Pr{Xi =ii,...,Xk = ik) 



det (^{n,.....}4n......} 

fc!i:5cM:|5|=fcdet [AsAl) 



if ii, . . . , ifc are distinct 



otherwise 



Then the marginal probabilities Pr [Xt = i \ Xi = ii, . . . , = it-i) have the following interpre- 
tation in terms of the coefficients of certain characteristic polynomials. 

Lemma 14. Let [ii,... ,it-i) S [m]*""^ such that Pr (Xi = ii, . . . ,Xt-i = it-i) > 0, for a random 
k-tuple {Xi,X2, ■ ■ ■ ,Xfc) from the extended volume sampling over k-tuples. Let S = {ii, . . . ,it-i}, 
B = A - Trs{A) and Ci = B - 7r{j}(5) = A - ■Ksvj{i}{A). Then, 

Pr (X, = z I X, = z„ . . . , X,_, = H.,) K-k^tiQCDl 



{k-t + l)\Cm-k+t-l{BBT)\' 



Proof. 



Pr{Xt = i\ Xi = ii,..., Xt-i = it-i) 

_ Z](jt+i,...,jfe)eM*^~* P''(Xi = n, . . . ,Xt-i = it-i,Xt = i,Xt+i = k+i, ... ,Xk = ik) 
^(«tv,ifc)eM''"*+^ ^'^^ = ^1, • • • ) Xt-i = it-i,Xt = it,Xt+i = it+i, . . . , X^ = ik) 

{k - t)\ J2TC[m] : |5U{i}UT|=fc,|T|=fc-t det M5U{i}UT^5u{j}UT , 



{k-t + 1)! Etch : \suT\=k,\T\=k-t+i det (AsutA'^ut) 



Etch : |S'u{i}uT|=fc,|T|=fc-tdet (^5^5) det (B^iy^xBl 



i}UT 



by Lemma [12] 



{k-t + l) Etch : |suT|=fc,|T|=fc-j+i det {AsA^) det [BrBlf) 

X^TCHl ■ l5'U-t?IUT|=/i; \T\=k t II ^^|| det ^C^C^^ 

~ ' ' — — — -. ™- by Lemma [T2] applied to B 

[k-t + l) Ltch ■■ \svjT\=k,\T\=k-t+i det [BtBj,) 

Etch ■■ \T\=k-t det {CtC^) 



(k-t + l) Etch : \T\=k^t+i det (BtB^) 

\\t>i\\'^ \Cm-k+tiCiC'[)\ , J 

— 7— , ^ ' T^,, by Lemma 111! 

{k-t+l)\Cm-k+t-l{BBT)\ • 



since the extra terms in the sum are all zero 



□ 
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With this lemma in hand, let us consider the following outline of our algorithm. We will later 
give two more efficient implementations of this outline, depending on how the p^'s are computed. 



Algorithm 1. Outline of our volume sampling algorithm 

Input: a matrix A G M™x" and 1 < A; < rank(j4). 

Output: a subset S oi k rows of A picked with probability proportional to det (^AgA'g'j. 

1. Initiahze 5^0 and B A. For t = I to k do: 

(a) For i = 1 to m compute: 

Pi = \\bif ■ \c.m-k+t{CiCl)\ , 

where Ci = B — 7r|j is a matrix obtained by projecting each row of B orthogonal 
to bi. 

(b) Pick i with probability proportional to pi. Let S ^ S U {i} and B ^ C^. 

2. Output S. 



Now we show the correctness of the algorithm: 

Proposition 15. The probability that our volume sampling algorithm outlined above picks a k- 
subset S is proportional to det (yAs^^ ■ This algorithm can be implemented with a cost of 0{km^n+ 
km^~^^ logm) arithmetic operations. 

Proof. By Lemma [T^ for any ii,i2, ■ ■ ■ ,ik such that Pr {Xi = ii, . . . , Xf. = i^), the probability that 
our algorithm picks a sequence of rows indexed ii,i2, ■ ■ ■ ,ik in that order is equal to 

det jjAl".^ .^1 

TTPr(Xt = it\Xi = ii,...,Xt-i =it-i) = Pr{Xi = ii, . . . , Xt = ik) = , ,^ — . ''/i aT\ 

t=i ^- Z2sc[m] : \s\=k det [AsA'sJ 

Otherwise, the probability is zero because in the execution of the algorithm, = for some step 
t. This proves the correctness of our algorithm. 

Given that one can compute the characteristic polynomial of an m-by-m matrix in 0{m^ log m) 
(see Section [3]), our outline can be implemented with the following count of arithmetic operations: 
for every t and i, 0{m?n) to compute CiCj , 0{m'^n + m^logm) in total for pi. Thus, volume 
sampling in 0{km^n + km^^^ log m). □ 

4.1 Efficient volume sampling without SVD 

Here we present the first (faster) subroutine for computing the marginal probabilities pi's within 
the volume sampling algorithm outlined in Section HI The two main ideas behind this subroutine 
are: (1) We can work with B'^B, Cfd € M'"''" instead of BB'^, dCj G M"*^"". Assuming m > n, 
this saves on running time. (2) Each Ci is a rank-1 update of B and therefore, once we have B'^B, 
it can be used to compute all CjCi efficiently. 
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Algorithm 2. First subroutine for marginal probabilities 

Input: B G M'"^^". 
Output: pi,P2, ■ ■ ■ ,Pm- 

For i = 1 to m do: 

1. Compute the matrix CjCi £ R"^" by the following formula 



Lid-B B — ^ 1|, 1,2 + 

Oi Oi Oi 



5^56,6^ b^bTB^Bb^bT 



2. Compute the characteristic polynomial of C/ Cj and output 

Pi = • |cn-fc+t(Cf Ci)| . 



Proposition 16. For any given B £ M'"^", the Algorithmic above computes pi, . . . ,pm in 0{mn^ log n) 
arithmetic operations. 

Proof. B'^B can be computed in time 0{mn'^). Observe that since Cj is obtained by projecting 
each row of B orthogonal to bi, 

Ci = B- -1-^BbibJ, 



and therefore, 



rTr R^R ^^-Sfe^^f bjbjBTB b,bfB^Bb,bJ 

CiCi = B B — ^ 1|, + ^rp — • 

WbiW Oil 6,; 



So once we have B^B, for each i, CfCi can be computed in time 0{n^) and the characteris- 
tic polynomial of CjCi can be computed in time O(n'^logn) [21 Section 16.6]. By Lemma II 1^ 
Cm-k+t{CiCf) = Cn-k+tiCfCi) and hence, the above subroutine results into an 0{kmn'^ logn) 
time algorithm for volume sampling. □ 

Theorem 17 (same as Theorem [7]). The randomized algorithm given by the combination of the 
algorithm outlined in Algorithm{l\with Algorithmic as its subroutine, when given a matrix A G ]R™x" 
and an integer 1 < k < rank(^), outputs a random k-subset of the rows of A according to volume 
sampling, using 0{kmn'^ log n) arithmetic operations. 

Proof. The proof follows by combining Proposition [15] and Proposition [T6l and since we compute 
all the Pi's simultaneously in each round in 0(mn'^ log n) arithmetic operations, the total number 
of arithmetic operations is 0(A;mn^ log n). □ 

4.2 Efficient volume sampling using SVD 

Taking further the idea that each Cj is a rank-1 update of B, we can give a faster algorithm based 
on the singular value decomposition of B. Given the singular value decomposition of a matrix 
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and using the matrix determinant lemma (Lemma [T3]) . one can give a precise formula for how 
the characteristic polynomial changes under a rank-1 update. Using this subroutine in the volume 
sampling algorithm outlined in Section [J] we get an algorithm for nearly-exact volume sampling 
(depending on the precision of the computed SVD) in time Ol^kTsyd + knin^^, where Tgy^ is the 
running time of SVD on an m-hy-n matrix. 



pm 



Algorithm 3. Second subroutine for marginal probabilities. 

Input: B G M™^". 
Output: pi,P2,... ,Pm- 

1. Compute the (thin) singular value decomposition B = UT,V'^, say U G i^"'^" and S, y G 
jgnxm^ and keep the singular values (T2, • • • , Cn and define an+i = ■ ■ ■ = (Jm = 0. Also 
keep the columns of U, i.e., the left singular vectors ui,U2, ■ ■ ■ ,Un G 

2. Compute the polynomial products 

m 

f{x)=Y[{x-af), and 

1=1 

gj{x) = JJ(a; — erf), for all I < j < m. 

3. For i = 1 to m output: 



II!, I|2 
Vi = \\0i\\ 



1 

coefficient of in /(x) H ^'^a'j{uj)fgj{:} 

\\bi\\ 



Proposition 18. In the real arithmetic model and given exact U and S, using the Algorithmic 
as a subroutine inside Algorithmic outlined for volume sampling, we get an algorithm for volume 
sampling. If Tgvd is the running time for computing the singular value decomposition of m-hy-n 
matrices, the algorithm runs in time 0{kTsvd + kmn^). 

Proof. Using the matrix determinant lemma (Lemma I13p . the characteristic polynomial of CiCj 
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can be written as 



det {xl - dCj) = det {xl - BB^ + -^^{Bbi){Bbif^ 

1 + j^bfB'^ixI - BB'^)-^Bh}j det {xl - BB'^) 

1 + -^^hjB'^ixI - ilf^U^y^Bh^j det {xl - BB^) 
by extending U,^,V to get B = UtV^ with C7, S G M™^™ and F G R' 
1 + -^^bfB^U{xI - t^Y^tj^Bb^ det {xl - BB^) 

1 + -^bfvf^ixl - f^y^tV^b,^ det {xl - BB^) 






m 

n 

1=1 


) 




m 

n 

1=1 


x-a] 1 



[x - erf) 



m ^ n 

1=1 11^*11 3=1 l^j 

1 " 



Thus, 



1 

Cm-k+t{CiCl) = coefficient of in f{x) + —2 ^ a]{uj)fgj{x). 

1 1 Oil I j=l 



Once we have the singular value decomposition of B, f[x) and gj{x) can all be computed in time 
0{'n?) using polynomial products. This is because there are at most n non-zero <Tj's. Thus, f{x) 
and all the gj{x) for 1 < j < m can be computed in time 0{m'n?) and then using the above formula 
we get Cm-k+t{CiCl). □ 

4.3 Approximate volume sampling in nearly linear time 

Magen and Zouzias [T7] showed that the random projection lemma of Johnson and Lindenstrauss 
can be generalized to preserve volumes of subsets after embedding. Here is a restatement of Theorem 
1 of [T^ using 0{e/k) instead of e in their original statement. 



Theorem 19. fl^ For any A G M"*^", 1 < k < n and < e < 1/2, there is 

d = 



k"^ log m 
2 
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and there is a mapping f : M" — )• such that 

det {AsA^) < det (AsA^) < (1 + e) det (^5^5) > 



for all S C [m] such that \S\ < k, where A G i^^x*^ fias its i-th row as f{ai). Moreover, f is a 
linear mapping given by multiplication with a random n by d matrix with i.i.d. Gaussian entries, 
so computing A takes time 0{mnd). 

Theorem 20 (same as Theorem [9]). Using random projection for dimensionality reduction, the 
polynomial time algorithm for volume sampling mentioned in Theorem^ (i.e.. Algorithmic with 
Algorithmic as its subroutine), gives (1 + e)- approximate volume sampling, using 

O \ mn log m • — + m log"^ m ■ — log(fce^"'^ log m) j . 

arithmetic operations. 

Proof. Using Theorem [19] and doing volume sampling of fc-subsets of rows from A gives (1 + e)- 
approximation to the volume sampling of fc-subsets of rows from A. This can be done in two steps: 
first, we compute A using matrix multiplication in time 0{mnd) and second, we do volume sampling 
on A using the algorithm from Subsection 14. 1[ Overall, it takes time 0{mnd + kmd^logd), which 
is equal to 

I mn log m ■ — + m log m ■ — ^ 
Moreover, this can be implemented using only one pass over the matrix A with extra space m log m- 

kye\ □ 



5 Derandomized row/column-subset selection 



Our derandomized row-subset selection algorithm is based on a derandomization of the volume 
sampling algorithm in Section HI using the method of conditional expectations. Again, it may be 
easier to consider volume sampling extended to random A;-tuples (-'^i, . . . ,^fc) where 



Pr{Xi =ii,...,Xk = ik) 



det (A 



.,*fe}^{n,...,ifc} 



5C[m] : 151=*: 



det [AsA^] 



if ii, . . . , ifc are distinct 



otherwise 



From Theorem [2] we know that 



L4 - vr. 



{Xu...,Xk}i^)\\p 



< {k + l)\\A-Ak\\l, 



where the expectation is over {Xi, . . . , X/^). 

Let us consider ii, . . . , it-i for which Pr {Xi = ii, . . . , Xt-i = it^i) > 0. Let S = {ii, . . . , it-i} 
and look at the conditional expectation. The following lemma shows that these conditional expec- 
tations have an easy interpretation in terms of the coefficients of certain characteristic polynomials, 
and hence can be computed efficiently. 
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Lemma 21. Let (ii, . . . ,it~i) G [mY ^ be such that Pr (Xi = ii,. . . ,Xt-i = it^i) > for a random 
k-tuple {Xi, X2, . . . , Xk) from extended volume sampling. Let S = {ii, . . . , it~i} and B = A—irsi-^)- 
Then 



\A-^[x„...,x,}{A)\\ 



Xi — ii, . . . , Xt-i — it-i 



{k-t + 2)c^_k+t^2[BB 



Cm-k+t-1 



{BET) 



Proof. 
E 



11^- 7r{Xi,...,Xfe}(^)|lF I ^1 = n,...,^t-i = it~i 

^ ||^-7r{ii,...,i,}(A)||^Pr(Xi =ii,...,Xfc = ifc I Xi = n,...,Xt_i = U^i) 

(it,.-,«fc)eM''"'+^ 

|2 Pr {Xi = ii,. .. ,Xk = ik) 



E 

(it,...,ifc)GHfe-* + l 

(it,..,ife)eH'^-t+i l^{jt,...,jk)&[m] 



Em 
1=1 



^det(^{i^,...,i^}A[.^^ ^^^^ 



ife-t+ 



1 det A 



Hji,.-,«t-ijt,.-jfe}^{ii,...,ii_i,jt,...jfc} 



where L> = A - 7r{j^_...^j^}(A) 
E{it,...,jfc)6M''-'+' "^^^ (^{nv..,it-ijt,...jfc}^{ii,...,it_i,jt,...,ifc} 

(fc - t + 1)! J2TC[m] : |SUT|=fc,|r|=fc-t+l Sz^SUT ^et (^{OUSUT^{i}uSuT J 



(A;-i + l)!ETc 



[m] : \SUT\=k,\T\=k~t+l 



det (Asut-4^ljt) 



I]rC[m] : |5UT|=fc,|T|=fc-t+l Y^li^SUT det ( 



'{/}UT 



Etch : \suT\=k,\T\=k-t+i det (^5^5) det (St^J) 

Erc[m] : |5UT|=fc,|r|=fc-t+l Ei^SUT' 



by Lemma [12] 



' det (^{i}uT^{J}uT) 
Etch : \suT\=k,\T\=k-t+i det (BtB^) 
{k-t + 2) ErcH : |5uT|=fc+i,|r|=fc-f+2 det {BtB^) 

Etch : |5uT|=fc,|T|=fc-t+i det {BtB^) 
{k-t + 2) ErcH : \T\=k-t+2 det {BtB^) 

Etch : |r|=fc-t+i det {BtB^) 
{k-t + 2)\c^^k+tMBBT) 



the extra terms in the numerator and 
the denominator are zero 



\Cm-k+t-l{BB'^) 



by Lemma [TTl 



□ 



Knowing the above lemma, it is easy to derandomize our algorithm outlined for volume sam- 
pling. In each step, we just compute the new conditional expectations for each additional i, and 
finally pick the i that minimizes the conditional expectation. 
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Algorithm 4. Derandomized row/column-subset selection 

Input: a matrix A G M'"X" and 1 < A; < Tank{A). 
Output: a subset S oi k rows of A with the guarantee 

\\A-7rsiA)\\l<ik + l)\\A-Ak\\l. 

1. Initiahze 5^0 and i? ^ ^. For t = 1 to /c do: 

(a) For i = 1 to 77T, do: compute Cn-k+t-i{Cj Ci) and Cn-k+t{Cj Ci), where Ci = B — 
7r|j}(i?) is the matrix obtained by projecting each row of B orthogonal to hi. 

(b) Pick i that minimizes \cn-k+t-i{Cj Ci)\ / \cn~k+t{Cf Ci)\. Let S ^ 5 U {z} and 
B^Ci. 

2. Output S. 



Theorem 22 (same as Theorem [8]). Deterministic Algorithm^ when given a matrix A G ]^™x" 
and an integer 1 < k < rank(74), outputs a k-suhset S of the rows of A, using 0{kmn^ \ogn) 
arithmetic operations, such that 



\\A-TTs{A)\\p < Vk + 1\\A-Ak\\p 

\\A - 7rs{A)\\^ < y^{k + l){n-k)\\A-Ak\\^. 

Proof. By applying Lemma [2T] to S U {i} instead of S , as Ci = B — 7r|j}(i?) = A — 7r5u|j}(^), we 
see that the step t of our algorithm picks i that minimizes 

I 1 1 2 

1^ - 7r{Xi,...,Xfc}(^)|lF I ^1 = ^1' ■ ■ ■ '^i-l = H-l,Xt = 



(k-t + l) 


Cn-k+t-l{Cj Ci)\ 




Cn-~k+t{Cj Ci) 





[k-t + l) 


Cm-k+t-l{CiCf)\ 




Cm— 


k+t{CiCj) 





The correctness of our algorithm follows immediately from observing that in each step t, 
E - 7r{Xi,...,Xfc}(^)||^ I Xi =ii,...,Xt-i = it-i 

m 

= Y^Pr {Xt = i\ Xi=ii,...,Xt-i = it-i)E\\\A - 7T{Xu...,x^}iM\l I Xi = ii, . . . , Xt-i = it-i, Xt 



i=l 

and that we started with 



A - 7r^x,,...,x,}{A)\\l < {k + 1) \\A - Akn . 



2 



The guarantee for spectral norm follows immediately from our guarantee in the Frobenius norm, 
just using properties of norms and the fact that rank(^ — Ak) < n — k: 

\\A - 7Ts{A)\\l < \\A - 7Ts{A)\\l <{k + 1)\\A - AkWl < + " " AkWl 

Moreover, this algorithm runs in time 0{kmn^ logn) if we use the subroutine in Subsection 14.11 
to compute the characteristic polynomial of CiCf using that of Cfd. □ 
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6 Lower bound for rank-1 spectral approximation using one row 



Here we show a lower bound for row/column-subset selection. We prove that there is a matrix 
A G such that using the span of any single row of it, we can get only r2(-y/n)-approximation 

in the spectral norm for the nearest rank-1 matrix to A. This can be generalized to a similar i}{^/n) 
lower bound for general k by using a matrix with k block-diagonal copies of A. 

Theorem 23 (same as Theorem 1 1011 . There exists a matrix A G such that 



\\A - TT{iy{A)\\^ = ^l{^/n) \\A - Ai\\2 , for all 1 < i < n, 

where 'K^^y{A) G is the matrix obtained by projecting each row of A onto the span of its 

i-th row Oj. 

Proof. Consider A G M"^("+^) with entries as follows: 



/ 1 e 

1 e 

1 

VI ... 



... 0\ 

. . . 

. . . 

6 / 



< e < 1. 



Let B be the best rank-1 approximation to A whose rows lie in the span of (l,e, 0, 
that matter, any fixed row of A). Then, we want to show that 



, 0) (or for 



\A-B\ 



> 



n 



\A- A 



lib 



fT2(A). 



We first compute the singular values of ^, i.e., the positive square roots of the eigenvalue of 
AA^ G ffi'^^'^. 



/ l + e2 



AA' 



\ 



1 + e^ 



1 \ 
1 

1 1 

l + €^ 1 

1 1 + eV 



(1, 1, . . . , 1) is an eigenvector of AA^ with eigenvalue n + e^. Thus, cri[A) = -v/n+~e^. Observe 
that, by symmetry, all other singular values of A must be equal, i.e., a2{A) = a^{A) = . . . = an{A). 
However, 



n 



+ ne^ = y a,{Af = ai{Af + (n - 1)^2(^)2 = n + + (n - l)a2{Af 



1=1 



Therefore, \\A — AiW^ = (72(^4) = e. 

Now denote the i-th row of A by Oj. By definition, the i-th row of B is the projection of onto 
span (ai). We are interested in the singular values of ^ — i?. For i > 2: 



-ai 



l«i| 



1 + e2 ' 1 + e- 



■,o. 



,0,...,0 



{i -\- l)-th coord. 
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Thus, {A - B){A- B)'^ £ M"^" can be written as 



{A- B){A- B)'^ 



( 





V 





.2 



1+^ 



l+e2 



Again, (0, 1, 1, . . . , 1) is the top eigenvector of {A — B){A — B) and using this we get. 



\A-B\\l = ai{A-B) 



e\2 + e' 
l + e2 



+ (n - 2) 



1 + 



Therefore, 



B\ 



2 



2 ■ 



□ 



7 Discussion 

We analyzed efficient algorithms for volume sampling that can be used for row/column subset 
selection. Here are some ideas for future investigation suggested by this work: 

• It would be interesting to explore how these algorithmic ideas are related to determinantal 
sampling [161 [T3] and, in particular, the generation of random spanning trees. 

• Find practical counterparts of the algorithms discussed here. In particular, we do not analyze 
the numerical stability of our algorithms. 

• Is there an efficient algorithm for volume sampling based on random walks? This question is 
inspired by MCMC as well as random walk algorithms for the generation of random spanning 
trees. 
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